Obesity is a growing global health challenge, and this study aims to predict obesity levels based solely on lifestyle and eating habits. This study develops and compares four tree-based machine learning models to predict obesity classification from demographic and behavioral data. Models evaluated include single decision trees using deviance and Gini splitting criteria, bagging, and random forest. The dataset contains 531 observations with 15 predictor variables including age, height, water intake, eating frequency, physical activity, and family history. Random forest emerged as the best model with 31.8% total error and AUC of 0.75. Age and height are the strongest predictors of obesity, followed by behavioral factors like water intake and eating frequency. Ensemble methods significantly outperform single trees, with random forest achieving 31.8% error compared to 33.6% for single decision trees. The main challenge is classifying overweight cases due to class imbalance and overlap with other obesity categories. Results show that demographic and lifestyle factors can predict obesity reasonably well, but prediction accuracy varies across obesity classes.
Obesity is a major public health challenge. The World Health Organization (WHO) reports that global obesity prevalence has nearly tripled since 1975. Current data indicates that over one billion people live with obesity, including approximately 890 million adults and 160 million children and adolescents. Incidence rates are rising in low- and middle-income countries, where healthcare systems often face a dual burden of malnutrition and metabolic disease. This condition drives non-communicable diseases, including type 2 diabetes, cardiovascular disease, and cancer, through mechanisms such as systemic inflammation and insulin resistance. The economic impact of obesity is substantial. The World Obesity Federation projects that the global cost of overweight and obesity will reach $4.32 trillion annually by 2035, equivalent to nearly 3% of global GDP. Indirect costs, stemming from lost workforce productivity, absenteeism, and premature mortality, often exceed direct medical expenditures. Consequently, the failure to address obesity affects national economies alongside healthcare systems. Addressing this challenge requires moving from reactive management to early identification. Current screening relies on Body Mass Index (BMI), a lagging indicator that may not capture body composition nuances or identify risk before weight gain occurs. Predictive tools that identify at-risk individuals based on modifiable behaviors, such as diet and physical activity, are necessary. This study addresses this gap by developing and benchmarking machine learning classification models to predict obesity status using demographic and lifestyle characteristics.
The dataset originates from the study by Palechor and de la Hoz Manotas (2019), titled “Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.” Data was collected via an online questionnaire administered to individuals aged 14-61 years residing in Colombia, Peru, and Mexico. The original dataset contains 2,111 observations. However, approximately 75% of these observations were synthetically generated using SMOTE (Synthetic Minority Over-sampling Technique) to balance the obesity classes. This synthetic augmentation is a critical consideration addressed in our data preparation.
The following table describes all variables in the dataset:
| Variable | Type | Description | Levels/Units |
|---|---|---|---|
| Demographic Variables | |||
| Gender | Categorical | Biological sex | Female, Male |
| Age | Numeric | Age in years | 14-61 years |
| Height | Numeric | Height | Meters |
| Weight | Numeric | Weight | Kilograms |
| Eating Habits | |||
| FAVC | Binary | Frequent consumption of high-calorie food | No, Yes |
| FCVC | Ordinal | Frequency of vegetable consumption | 1=Never, 2=Sometimes, 3=Always |
| NCP | Ordinal | Number of main meals per day | 1, 2, 3, 4+ |
| CAEC | Categorical | Consumption of food between meals | No, Sometimes, Frequently, Always |
| CH2O | Ordinal | Daily water consumption | 1=Low, 2=Medium, 3=High |
| CALC | Categorical | Alcohol consumption frequency | No, Sometimes, Frequently, Always |
| Lifestyle Variables | |||
| SMOKE | Binary | Smoking status | No, Yes |
| SCC | Binary | Calorie consumption monitoring | No, Yes |
| FAF | Ordinal | Physical activity frequency | 0=None, 1=1-2 days, 2=2-4 days, 3=4+ days |
| TUE | Ordinal | Time using technology devices | 0=0-3h, 1=3-5h, 2=5h+ |
| MTRANS | Categorical | Transportation used | Automobile, Bike, Motorbike, Public Transport, Walking |
| family_history_with_overweight | Binary | Family history of overweight | No, Yes |
| Target Variable | |||
| NObeyesdad | Categorical | Obesity level classification | 7 levels (Insufficient Weight to Obesity Type III) |
This analysis addresses the following research question:
What demographic and behavioral factors best predict obesity risk, and which machine learning method provides the most accurate classification?
library(tree)
library(randomForest)
library(pROC)
library(ggplot2)
library(knitr)
library(kableExtra)
set.seed(123) The original dataset contains both real survey responses and synthetically generated observations (via SMOTE). Using synthetic data may introduce artifacts that do not reflect human behavior patterns. To ensure our findings are valid and generalizable, we identify and filter out synthetic observations.
Identification Method: Real survey responses have integer values for variables FCVC, NCP, CH2O, FAF, and TUE, while SMOTE-generated records contain decimal values due to the interpolation process.
obesity <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv")
is_decimal <- function(x) { return((x %% 1) != 0)}
obesity$data_type <- "Real"
synthetic_indices <- which( is_decimal(obesity$FCVC) |is_decimal(obesity$NCP) | is_decimal(obesity$CH2O) | is_decimal(obesity$FAF) | is_decimal(obesity$TUE))
obesity$data_type[synthetic_indices] <- "Synthetic"
table(obesity$data_type)
Real Synthetic
531 1580
Real Synthetic
25.15396 74.84604
We proceed with only the 531 real observations to ensure validity of findings.
All categorical variables must be properly converted to R factors for modeling. This ensures correct treatment in tree-based methods.
# Convert binary variables: FAVC, SMOKE, SCC, family_history, Gender
obesity_real$FAVC <- factor(obesity_real$FAVC, levels = c("no", "yes"))
obesity_real$SMOKE <- factor(obesity_real$SMOKE, levels = c("no", "yes"))
obesity_real$SCC <- factor(obesity_real$SCC, levels = c("no", "yes"))
obesity_real$family_history_with_overweight <- factor(obesity_real$family_history_with_overweight,levels = c("no", "yes"))
obesity_real$Gender <- factor(obesity_real$Gender, levels = c("Female", "Male"))
# Convert ordinal variables: FCVC, CH2O, FAF, TUE
obesity_real$FCVC <- factor(obesity_real$FCVC, levels = c(1, 2, 3), ordered = TRUE)
obesity_real$CH2O <- factor(obesity_real$CH2O, levels = c(1, 2, 3), ordered = TRUE)
obesity_real$FAF <- factor(obesity_real$FAF, levels = c(0, 1, 2, 3), ordered = TRUE)
obesity$NCP <- factor(obesity$NCP, levels = c(1, 2, 3, 4))
obesity_real$TUE <- factor(obesity_real$TUE, levels = c(0, 1, 2), ordered = TRUE)
# Convert nominal variables: CAEC, MTRANS, NObeyesdad
obesity_real$CAEC <- factor(obesity_real$CAEC, levels = c("no", "Sometimes", "Frequently", "Always"))
obesity$CALC <- factor(obesity$CALC,levels = c("no", "Sometimes", "Frequently", "Always"))
obesity$MTRANS <- factor(obesity$MTRANS, levels = c("Automobile", "Bike", "Motorbike", "Public_Transportation", "Walking"))
obesity$NObeyesdad <- factor(obesity$NObeyesdad,levels=c("Insufficient_Weight","Normal_Weight","Overweight_Level_I","Overweight_Level_II","Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"))
str(obesity_real)'data.frame': 531 obs. of 18 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
$ Age : num 21 21 23 27 22 29 23 22 24 22 ...
$ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
$ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
$ FCVC : Ord.factor w/ 3 levels "1"<"2"<"3": 2 3 2 3 2 2 3 2 3 2 ...
$ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
$ CAEC : Factor w/ 4 levels "no","Sometimes",..: 2 2 2 2 2 2 2 2 2 2 ...
$ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ CH2O : Ord.factor w/ 3 levels "1"<"2"<"3": 2 3 2 2 2 2 2 2 2 2 ...
$ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ FAF : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 4 3 3 1 1 2 4 2 2 ...
$ TUE : Ord.factor w/ 3 levels "0"<"1"<"2": 2 1 2 1 1 1 1 1 2 2 ...
$ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
$ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
$ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
$ data_type : chr "Real" "Real" "Real" "Real" ...
Several categorical variables contain levels with very few observations, which can cause modeling issues. We merge these into meaningful grouped categories.
Original MTRANS Counts:
Automobile Bike Motorbike Public_Transportation Walking
107 7 11 351 55
obesity_real$MTRANS_3 <- as.character(obesity_real$MTRANS)
obesity_real$MTRANS_3[obesity_real$MTRANS %in% c("Bike", "Walking")] <- "Active"
obesity_real$MTRANS_3[obesity_real$MTRANS %in% c("Automobile", "Motorbike")] <- "Private_Motor"
obesity_real$MTRANS_3[obesity_real$MTRANS == "Public_Transportation"] <- "Public_Transport"
obesity_real$MTRANS_3 <- factor(obesity_real$MTRANS_3, levels = c("Active", "Private_Motor", "Public_Transport"))
cat("\nMerged MTRANS_3 Counts:\n")
Merged MTRANS_3 Counts:
Active Private_Motor Public_Transport
62 118 351
Rationale: The categories “Bike” and “Motorbike” had extremely low frequencies, leading to sparsity. To address this, we grouped categories based on physical exertion: “Active” (Walking and Bike) and “Private Motor” (Automobile and Motorbike).
Original NCP Counts:
1 3 4
124 359 48
obesity_real$NCP_3 <- NA_character_
ncp_numeric <- as.numeric(as.character(obesity_real$NCP)) # Ensure we treat it as number for logic
obesity_real$NCP_3[ncp_numeric <= 2] <- "Low_1-2"
obesity_real$NCP_3[ncp_numeric == 3] <- "Normal_3"
obesity_real$NCP_3[ncp_numeric >= 4] <- "High_4+"
obesity_real$NCP_3 <- factor(obesity_real$NCP_3, levels = c("Low_1-2", "Normal_3", "High_4+"))
cat("\nMerged NCP_3 Counts:\n")
Merged NCP_3 Counts:
Low_1-2 Normal_3 High_4+
124 359 48
Rationale: Most people consume 3 meals a day (Normal). Those eating 1 or 2 meals likely have caloric deficits or skipping habits, while 4+ suggests frequent eating.
Always Frequently no Sometimes
1 45 191 294
obesity_real$CALC_3 <- as.character(obesity_real$CALC)
obesity_real$CALC_3[obesity_real$CALC == "no"] <- "No"
obesity_real$CALC_3[obesity_real$CALC == "Sometimes"] <- "Sometimes"
obesity_real$CALC_3[obesity_real$CALC %in% c("Frequently", "Always")] <- "Frequent"
obesity_real$CALC_3 <- factor(obesity_real$CALC_3, levels = c("No", "Sometimes", "Frequent"))
print(table(obesity_real$CALC_3))
No Sometimes Frequent
191 294 46
Rationale: The “Always” category contained negligible observations in the real dataset. We merged “Always” with “Frequently” into a single “Frequent” category.
Continuous variables (Age, Height) are binned into quantile-based categories to avoid assumptions of linear relationships.
make_quantile_bins <- function(x, n_bins = 5, labels = NULL) {
probs <- seq(0, 1, length.out = n_bins + 1)
qs <- unique(quantile(x, probs = probs, na.rm = TRUE))
# Create bins
cut(x, breaks = qs, include.lowest = TRUE, labels = labels[1:(length(qs)-1)])}
obesity_real$Age_bin5 <- make_quantile_bins( obesity_real$Age,n_bins = 5, labels = c("Very_Young", "Young", "Adult", "Mature", "Older"))
obesity_real$Height_bin5 <- make_quantile_bins( obesity_real$Height, n_bins = 5,labels = c("Very_Short", "Short", "Medium", "Tall", "Very_Tall"))
print(table(obesity_real$Age_bin5))
Very_Young Young Adult Mature Older
110 176 37 113 95
Very_Short Short Medium Tall Very_Tall
125 105 96 99 106
Note on Weight: Weight is intentionally excluded from the predictor set because obesity classification is derived from BMI (which uses weight). Including weight create circular logic and inflate model performance.
The original 7-class target variable is grouped into 3 classes to improve model stability given our reduced sample size.
obesity_real$Obesity_3 <- NA_character_
obesity_real$Obesity_3[obesity_real$NObeyesdad %in% c("Insufficient_Weight", "Normal_Weight")] <- "Normal_or_Under"
obesity_real$Obesity_3[obesity_real$NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II")] <- "Overweight"
obesity_real$Obesity_3[obesity_real$NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III")] <- "Obese"
obesity_real$Obesity_3 <- factor(obesity_real$Obesity_3, levels = c("Normal_or_Under", "Overweight", "Obese"))
cat("--- Target Variable Distribution (3 Classes) ---\n")--- Target Variable Distribution (3 Classes) ---
Normal_or_Under Overweight Obese
327 132 72
Normal_or_Under Overweight Obese
61.58192 24.85876 13.55932
barplot(table(obesity_real$Obesity_3),
main = "Target Distribution (3 Classes)",
col = c("green", "orange", "red"),
ylab = "Count")Rationale: With only 531 real observations, the 7-class problem would have insufficient samples per class for reliable modeling. The 3-class grouping ensure adequate class sizes.
set.seed(123)
n <- nrow(obesity_real)
train_size <- floor(0.8 * n)
train_index <- sample(1:n, size = train_size, replace = FALSE)
ob_train <- obesity_real[train_index, ]
ob_test <- obesity_real[-train_index, ]| Class | Training Set (n=424) | Test Set (n=107) |
|---|---|---|
| Normal_or_Under | 62% | 59.8% |
| Overweight | 25% | 24.3% |
| Obese | 13% | 15.9% |
The similar class distributions confirm that our random split produced representative training and test sets.
Based on the transformations above, our final predictor set consists of 15 variables:
allowed_predictors <- c("Gender","Age_bin5","Height_bin5","family_history_with_overweight","FAVC",
"FCVC","CAEC","SMOKE","CH2O","SCC","FAF","TUE","NCP_3","CALC_3","MTRANS_3")Excluded variables:
target_counts <- table(obesity_real$Obesity_3)
target_pct <- round(prop.table(target_counts) * 100, 1)
bp <- barplot(target_counts, main = "Distribution of Obesity Levels (3 Classes)", col = c("forestgreen", "orange", "firebrick"), ylim = c(0, max(target_counts) * 1.2),ylab = "Count")
text(x = bp, y = target_counts, labels = paste0(target_counts, "\n(", target_pct, "%)"), pos = 3, cex = 0.9)The distribution of the target variable is clearly imbalanced. This class imbalance implies that Models may naturally favor the majority class. methods robust to uneven class sizes such as Random Forest may work better when predicting obesity levels.
par(mfrow = c(1, 3))
gender_counts <- table(obesity_real$Gender)
bp_g <- barplot(gender_counts, main = "Gender Distribution", col = c("lightpink", "lightblue"),ylim = c(0, max(gender_counts) * 1.15))
text(bp_g, gender_counts, labels = gender_counts, pos = 3)age_counts <- table(obesity_real$Age_bin5)
bp_a <- barplot(age_counts, main = "Age Groups (Quantiles)", col = "gray80",las = 2,cex.names = 0.8)
height_counts <- table(obesity_real$Height_bin5)
bp_h <- barplot(height_counts, main = "Height Groups (Quantiles)", col = "gray80",las = 2,cex.names = 0.8)Key observations:
par(mfrow = c(2, 3))
barplot(table(obesity_real$FAVC),main = "High Calorie Food (FAVC)", col = "lightblue", las = 1)
barplot(table(obesity_real$FCVC), main = "Vegetable Cons. (FCVC)", xlab = "Level (1=Never, 3=Always)",col = "lightblue", las = 1)barplot(table(obesity_real$NCP_3),main = "Number of Meals (NCP)",col = "lightblue", las = 1)
barplot(table(obesity_real$CAEC), main = "Snacking Habits (CAEC)",col = "lightblue", las = 1, cex.names = 0.8)barplot(table(obesity_real$CH2O), main = "Daily Water Intake (CH2O)",xlab = "Level (1=Low, 3=High)",col = "lightblue", las = 1)
barplot(table(obesity_real$CALC_3), main = "Alcohol Cons. (CALC)", col = "lightblue", las = 1)Key observations:
par(mfrow = c(2, 3))
barplot(table(obesity_real$SMOKE),main = "Smoking Status", col = "lightsalmon", las = 1)
barplot(table(obesity_real$SCC),main = "Calorie Monitoring (SCC)", col = "lightsalmon", las = 1)barplot(table(obesity_real$FAF),main = "Physical Activity (FAF)", xlab = "Level (0=None, 3=High)",col = "lightsalmon", las = 1)
barplot(table(obesity_real$TUE), main = "Tech Use Time (TUE)", xlab = "Level (0=Low, 2=High)",col = "lightsalmon", las = 1)barplot(table(obesity_real$MTRANS_3), main = "Transport Mode (MTRANS)", col = "lightgreen", las = 2, cex.names = 0.8)
barplot(table(obesity_real$family_history_with_overweight), main = "Family History (Overweight)", col = "lightsalmon", las = 1)Key observations:
The most influential among the lifestyle variables is expected to be the family history of overweight. This variable shows a sharp and meaningful contrast between groups and captures a combined effect of genetic predisposition and long-term shared environment.Our model results indeed confirm that this variable becomes one of the strongest predictors and often appears among the first splits in decision trees and top-ranked features in Random Forest importance measures.
plot_bivariate_base <- function(data, x_var, fill_var, title, col_palette = NULL) {tab <- table(data[[x_var]], data[[fill_var]])
prop_tab <- prop.table(tab, margin = 1)
if (is.null(col_palette)) {col_palette <- rainbow(ncol(prop_tab))}
barplot(t(prop_tab), beside = FALSE,col = col_palette,main = title,xlab = "Obesity Class", ylab = "Percentage within Class",ylim = c(0, 1),las = 1)
legend("topright", legend = rev(colnames(prop_tab)), fill = rev(col_palette),title = fill_var,cex = 0.6,bty = "o",bg = "white") }
par(mfrow = c(2, 2), mar = c(5, 4, 4, 2))
col_pastel1 <- c("#FBB4AE", "#B3CDE3")
plot_bivariate_base(obesity_real, "Obesity_3", "family_history_with_overweight", "Family History by Obesity Class", col_pastel1)
col_blues <- c("#EFF3FF", "#BDD7E7", "#6BAED6", "#2171B5")
plot_bivariate_base(obesity_real, "Obesity_3", "FAF", "Physical Activity (FAF) by Obesity Class", col_blues)col_greys <- c("#F7F7F7", "#CCCCCC", "#969696", "#525252")
plot_bivariate_base(obesity_real, "Obesity_3", "Age_bin5", "Age Distribution by Obesity Class", col_greys)
col_oranges <- c("#FEE6CE", "#E6550D")
plot_bivariate_base(obesity_real, "Obesity_3", "FAVC", "High Calorie Food Consumption by Obesity Class", col_oranges)The plots show that high-calorie food consumption is ubiquitous, with nearly 90% of the entire population (regardless of weight class) reporting “Yes.” While the Obese group does show a slightly higher saturation of “Yes” responses compared to the Normal group, the lack of a sharp contrast suggests that what people eat (high calorie) is less predictive than how much they eat (NCP) or their genetic background (Family History)
Decision trees partition the predictor space into regions through recursive binary splitting. At each node, the algorithm selects the variable and split point that best separates the classes according to an impurity measure.
Splitting Criteria:
Cross-Validation for Tree Size: Large trees tend to overfit the training data. We use K-fold cross-validation to identify the optimal tree size that minimizes prediction error on held-out data.
Pruning: After determining the optimal size via cross-validation, we prune the tree by removing splits that do not improve cross-validated performance. This produces a simpler, more generalizable model.
R Implementation: We use the tree
package with cv.tree() for cross-validation and
prune.misclass() for pruning based on misclassification
rate.
Bagging reduces variance by training multiple trees on bootstrap samples of the training data and averaging their predictions.
Key characteristics:
R Implementation: We use randomForest()
with mtry = p (number of predictors = 15) and
ntree = 500.
Random Forest extends bagging by introducing additional randomness: at each split, only a random subset of predictors is considered.
Key characteristics:
R Implementation: We use randomForest()
with default mtry (√p for classification) and
ntree = 500.
train_data <- ob_train[, c("Obesity_3", allowed_predictors)]
test_data <- ob_test[, c("Obesity_3", allowed_predictors)]
tree_deviance <- tree(Obesity_3 ~ ., data = train_data, split = "deviance")
summary(tree_deviance)
Classification tree:
tree(formula = Obesity_3 ~ ., data = train_data, split = "deviance")
Variables actually used in tree construction:
[1] "Age_bin5" "CAEC" "Height_bin5"
[4] "FAVC" "CH2O" "CALC_3"
[7] "family_history_with_overweight" "TUE" "NCP_3"
[10] "FAF"
Number of terminal nodes: 18
Residual mean deviance: 1.319 = 535.5 / 406
Misclassification error rate: 0.2642 = 112 / 424
Our Deviance tree model predicts obesity classification using 18 different groups based on 10 variables. The model correctly classifies about 73.6% of cases, which means it makes mistakes about 26.4% of the time. Age is the first split, and then the model uses eating frequency, height, and water consumption to make further decisions. The error rate shows that obesity is hard to predict with just these features because different obesity groups have similar characteristics.
$size
[1] 18 16 10 6 5 1
$dev
[1] 158 157 151 154 156 162
$k
[1] -Inf 0.00 1.00 3.50 4.00 6.25
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree.sequence"
plot(cv_deviance$size, cv_deviance$dev, type = "b", pch = 19, col = "darkred", lwd = 2,xlab = "Tree Size (Number of Terminal Nodes)", ylab = "CV Misclassification Error",main = "Cross-Validation: Error vs Complexity")
grid()best_size_dev <- cv_deviance$size[which.min(cv_deviance$dev)]
cat("\nOptimal Tree Size (Nodes):", best_size_dev, "\n")
Optimal Tree Size (Nodes): 10
Cross-validation was used to find the best tree size that balances accuracy and simplicity. The results show that a tree with 10 nodes has the lowest cross-validation error of 151 misclassifications. This is better than the original tree with 18 nodes, which had 158 errors. A smaller tree is easier to understand and works better on new data because it avoids overfitting. The optimal tree size of 10 nodes removes unnecessary splits while keeping good prediction performance.
tree_deviance_pruned <- prune.misclass(tree_deviance, best = best_size_dev)
plot(tree_deviance_pruned)
text(tree_deviance_pruned, pretty = 0, cex = 0.8, col = "blue")The pruned tree with 10 nodes is simpler and easier to interpret than the original tree. The main splits are still age, eating frequency (CAEC), and height, which are the most important factors for obesity. For younger people who do not eat outside frequently, height determines whether they are normal weight or overweight. For older people, the tree checks family history, technology use time (TUE), number of meals (NCP_3), and physical activity (FAF) to make the final prediction. The pruned tree removes complex branches that do not help much with prediction, making it a cleaner model that is better for understanding obesity patterns.
pred_class_dev <- predict(tree_deviance_pruned, newdata = test_data, type = "class")
pred_prob_dev <- predict(tree_deviance_pruned, newdata = test_data, type = "vector")
conf_matrix_dev <- table(Predicted = pred_class_dev, Actual = test_data$Obesity_3)
print(conf_matrix_dev) Actual
Predicted Normal_or_Under Overweight Obese
Normal_or_Under 62 18 10
Overweight 2 6 4
Obese 0 2 3
total_error_dev <- mean(pred_class_dev != test_data$Obesity_3)
class_errors_dev <- 1 - diag(conf_matrix_dev) / colSums(conf_matrix_dev)
roc_dev <- multiclass.roc(test_data$Obesity_3, pred_prob_dev)
auc_dev <- auc(roc_dev)Performance Metrics:
| Metric | Value |
|---|---|
| Total Error Rate | 33.64% |
| Normal_or_Under Error | 3.12% |
| Overweight Error | 76.92% |
| Obese Error | 82.35% |
| Multiclass AUC | .6489 |
On the test set, the model has a total error rate of 33.64%, meaning it correctly predicts about 66% of cases. The model performs well at identifying normal weight people, with only 3.12% error. However, it struggles with overweight and obese groups, with error rates of 76.92% and 82.35%. This means the model often mistakes overweight and obese people as normal weight. The multiclass AUC score of 0.65 shows moderate performance. The confusion matrix reveals that most prediction errors come from confusing overweight and obese cases with normal weight, suggesting the model needs better ability to distinguish between these heavier weight groups.
Classification tree:
tree(formula = Obesity_3 ~ ., data = train_data, split = "gini")
Variables actually used in tree construction:
[1] "Age_bin5" "CH2O" "family_history_with_overweight"
[4] "SCC" "Gender" "Height_bin5"
[7] "MTRANS_3" "FCVC" "CALC_3"
[10] "FAF" "TUE" "CAEC"
[13] "NCP_3" "FAVC"
Number of terminal nodes: 53
Residual mean deviance: 0.9789 = 363.2 / 371
Misclassification error rate: 0.2052 = 87 / 424
set.seed(2024)
cv_gini <- cv.tree(tree_gini, FUN = prune.misclass)
plot(cv_gini$size, cv_gini$dev, type = "b", pch = 19, col = "darkblue", lwd = 2,xlab = "Tree Size", ylab = "CV Misclassification Error",main = "CV Error (Gini Split)")
grid()best_size_gini <- cv_gini$size[which.min(cv_gini$dev)]
cat("\nOptimal Tree Size (Gini):", best_size_gini, "\n")
Optimal Tree Size (Gini): 23
tree_gini_pruned <- prune.misclass(tree_gini, best = best_size_gini)
plot(tree_gini_pruned)
text(tree_gini_pruned, pretty = 0, cex = 0.8, col = "darkgreen")pred_class_gini <- predict(tree_gini_pruned, newdata = test_data, type = "class")
pred_prob_gini <- predict(tree_gini_pruned, newdata = test_data, type = "vector")
conf_matrix_gini <- table(Predicted = pred_class_gini, Actual = test_data$Obesity_3)
print(conf_matrix_gini) Actual
Predicted Normal_or_Under Overweight Obese
Normal_or_Under 55 16 5
Overweight 7 8 4
Obese 2 2 8
total_error_gini <- mean(pred_class_gini != test_data$Obesity_3)
class_errors_gini <- 1 - diag(conf_matrix_gini) / colSums(conf_matrix_gini)
roc_gini <- multiclass.roc(test_data$Obesity_3, pred_prob_gini)
auc_gini <- auc(roc_gini)Performance Metrics:
| Metric | Value |
|---|---|
| Total Error Rate | 33.64% |
| Normal_or_Under Error | 14.06% |
| Overweight Error | 69.23% |
| Obese Error | 52.94% |
| Multiclass AUC | 0.7377 |
A second tree model was built using Gini instead of deviance as the splitting criterion. The Gini tree created 53 nodes before pruning, with a training error rate of 20.52%, which is better than the deviance tree. Cross-validation showed that the optimal tree size is 23 nodes. This larger tree uses more variables including water intake, family history, gender, transportation method (MTRANS_3), and vegetable consumption (FCVC), showing that Gini considers different factors important. On the test set, the Gini tree has the same total error rate of 33.64% as the deviance tree, but the error distribution is different. It has higher error for normal weight people (14.06%) but lower errors for overweight (69.23%) and obese (52.94%) groups. The multiclass AUC of 0.74 is better than the deviance tree’s 0.65, indicating the Gini tree is more effective at distinguishing between obesity classes.
set.seed(208)
p <- length(allowed_predictors)
bag_model <- randomForest(Obesity_3 ~ ., data = train_data,mtry = p,ntree = 500, importance =TRUE)
cat("--- Bagging Model Summary ---\n")--- Bagging Model Summary ---
Call:
randomForest(formula = Obesity_3 ~ ., data = train_data, mtry = p, ntree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 15
OOB estimate of error rate: 31.6%
Confusion matrix:
Normal_or_Under Overweight Obese class.error
Normal_or_Under 238 19 6 0.09505703
Overweight 51 44 11 0.58490566
Obese 36 11 8 0.85454545
Bagging was used to improve predictions by combining 500 decision trees. In bagging, each tree is built from a random sample of the training data, and predictions are made by averaging results across all trees. The bagging model uses all 15 predictor variables at each split. The out-of-bag (OOB) error rate is 31.6%, which is lower than the single tree models. The error rates for each class show that bagging performs well for normal weight people (9.5% error) but still struggles with overweight (58.5% error) and obese (85.5% error) groups. Bagging reduces overfitting compared to single trees, but the class imbalance problem remains—the model is good at identifying normal weight people but makes many mistakes on heavier groups.
par(mfrow = c(1, 2))
varImpPlot(bag_model, main = "Bagging: Variable Importance")
par(mfrow = c(1, 1))
--- Variable Importance ---
Normal_or_Under Overweight Obese MeanDecreaseAccuracy MeanDecreaseGini
Gender 11.018457 1.9209436 4.5121991 11.656263 6.841538
Age_bin5 20.788830 16.8762955 10.9801202 28.551234 30.976471
Height_bin5 13.058442 14.6259492 4.5438254 19.821359 30.445264
family_history_with_overweight 8.038287 4.8179263 13.0884633 12.172318 9.617167
FAVC 8.708945 -0.2074888 4.2528992 8.402860 7.738184
FCVC 3.635740 10.1500667 2.5528119 9.269282 16.206663
CAEC 4.317601 17.4089668 6.1233649 15.335813 19.619300
SMOKE 6.137518 -2.4774282 2.5347723 4.585786 5.655272
CH2O 5.473174 12.0804602 14.3994823 16.028314 19.248949
SCC 8.955943 7.7995747 1.7889535 10.826806 4.665895
FAF 5.022509 2.3508167 10.1085902 8.967674 22.528533
TUE 6.473534 6.7554121 3.6319446 9.778572 16.398927
NCP_3 3.998360 7.5202190 3.7027033 8.320579 11.075586
CALC_3 -1.310107 4.5777954 9.8395815 5.384242 14.646874
MTRANS_3 4.197121 4.9049469 -0.1114467 6.348510 10.599633
Interpretation:
Age and height are the most important variables, ranking highest in both accuracy decrease and Gini decrease measures. Water intake (CH2O) and eating frequency (CAEC) are also important, especially for predicting obese cases. Family history of overweight matters more for obese prediction than for normal weight prediction. Physical activity (FAF) is useful for predicting obese cases. Gender and smoking are less important overall. The importance rankings show that demographic factors like age and height are the strongest predictors of obesity, followed by behavioral factors like water intake and eating habits. This matches what the single trees showed—age is always the first split, suggesting it is the strongest dividing factor for obesity classification.
pred_bag <- predict(bag_model, newdata = test_data, type = "class")
pred_prob_bag <- predict(bag_model, newdata = test_data, type = "prob")
conf_matrix_bag <- table(Predicted = pred_bag, Actual = test_data$Obesity_3)
print(conf_matrix_bag) Actual
Predicted Normal_or_Under Overweight Obese
Normal_or_Under 58 17 5
Overweight 4 6 6
Obese 2 3 6
total_error_bag <- mean(pred_bag != test_data$Obesity_3)
class_errors_bag <- 1 - diag(conf_matrix_bag) / colSums(conf_matrix_bag)
roc_bag <- multiclass.roc(test_data$Obesity_3, pred_prob_bag)
auc_bag <- auc(roc_bag)Performance Metrics:
| Metric | Value |
|---|---|
| OOB Error Rate (Training) | 31.6% |
| Total Error Rate | 34.58% |
| Normal_or_Under Error | 9.38% |
| Overweight Error | 76.92% |
| Obese Error | 64.71% |
| Multiclass AUC | 0.7207 |
On the test set, bagging has a total error rate of 34.58%, which is slightly worse than the training OOB error of 31.6%, suggesting some overfitting. The model performs well on normal weight people with only 9.38% error, but struggles with overweight (76.92% error) and obese (64.71% error) groups. The multiclass AUC of 0.72 is good and matches the Gini tree’s performance. Compared to single tree models, bagging makes fewer mistakes on obese cases (64.71% vs 82.35% for deviance tree, 52.94% for Gini tree) but has higher error on normal weight cases (9.38% vs 3.12% for deviance tree, 14.06% for Gini tree). Overall, bagging improves predictions for harder-to-classify groups like obese, but the class imbalance issue remains—the model is still better at identifying normal weight people than heavier groups.
set.seed(208)
p <- length(allowed_predictors)
mtry_rf <- floor(sqrt(p))
cat("Number of predictors (p):", p, "\n")Number of predictors (p): 15
mtry for Random Forest: 3
rf_model <- randomForest(
Obesity_3 ~ .,
data = train_data,
mtry = mtry_rf,
ntree = 500,
importance = TRUE
)
cat("--- Random Forest Model Summary ---\n")--- Random Forest Model Summary ---
Call:
randomForest(formula = Obesity_3 ~ ., data = train_data, mtry = mtry_rf, ntree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 31.13%
Confusion matrix:
Normal_or_Under Overweight Obese class.error
Normal_or_Under 247 12 4 0.0608365
Overweight 68 35 3 0.6698113
Obese 33 12 10 0.8181818
Random forest was built by using only 3 random variables at each split instead of all 15 variables like in bagging. This forces the model to consider different variables and reduces the correlation between trees, which can improve predictions. The random forest created 500 trees with an OOB error rate of 31.13%, which is slightly better than bagging’s 31.6%. The error rates by class show that random forest performs well on normal weight people (6.08% error), similar to bagging. However, error rates for overweight (66.98% error) and obese (81.82% error) groups are higher than bagging, suggesting that limiting variables to 3 per split makes it harder to distinguish between heavier groups. ### Variable Importance
--- Variable Importance ---
Normal_or_Under Overweight Obese MeanDecreaseAccuracy MeanDecreaseGini
Gender 9.926345 4.7722536 5.602449 11.991152 7.787860
Age_bin5 18.727391 14.6128396 9.703711 24.735348 29.728226
Height_bin5 11.757727 9.9748965 6.209251 16.422311 24.676735
family_history_with_overweight 8.102230 9.9517472 12.362316 14.961899 9.787452
FAVC 6.181116 1.7531174 3.827956 7.090257 8.254874
FCVC 4.742707 7.9106164 2.645394 8.577288 12.397965
CAEC 5.857850 15.7476444 4.821944 15.439690 16.848332
SMOKE 7.186437 0.1811882 3.195207 6.566445 5.002481
CH2O 8.712491 11.4174832 13.806336 16.712332 16.514889
SCC 4.737135 4.0154324 2.713076 6.563664 4.994514
FAF 5.595125 5.1553089 9.766466 10.480191 17.623271
TUE 6.217586 7.6423821 7.083067 10.721153 13.789949
NCP_3 2.943702 12.1303269 4.277942 11.478727 11.514739
CALC_3 3.528662 6.0691509 8.088388 8.465518 13.374823
MTRANS_3 3.861899 4.6649152 2.137497 6.011020 11.444024
Interpretation: The variable importance in random forest shows similar patterns to bagging, with age and height as the top factors. However, the rankings change slightly because random forest only considers 3 variables per split. Water intake (CH2O) becomes more important in random forest, suggesting that when age and height are not available in a split, water intake is a useful alternative. Eating frequency (CAEC), family history, and physical activity (FAF) remain important.
pred_rf <- predict(rf_model, newdata = test_data, type = "class")
pred_prob_rf <- predict(rf_model, newdata = test_data, type = "prob")
conf_matrix_rf <- table(Predicted = pred_rf, Actual = test_data$Obesity_3)
print(conf_matrix_rf) Actual
Predicted Normal_or_Under Overweight Obese
Normal_or_Under 62 19 5
Overweight 1 5 6
Obese 1 2 6
total_error_rf <- mean(pred_rf != test_data$Obesity_3)
class_errors_rf <- 1 - diag(conf_matrix_rf) / colSums(conf_matrix_rf)
roc_rf <- multiclass.roc(test_data$Obesity_3, pred_prob_rf)
auc_rf <- auc(roc_rf)Performance Metrics:
| Metric | Value |
|---|---|
| OOB Error Rate (Training) | 31.13% |
| Total Error Rate | 31.78% |
| Normal_or_Under Error | 3.12% |
| Overweight Error | 80.77% |
| Obese Error | 64.71% |
| Multiclass AUC | 0.7503 |
summary_df <- data.frame(
Model = c("Tree (Deviance)", "Tree (Gini)", "Bagging", "Random Forest"),
Total_Error = c(total_error_dev, total_error_gini, total_error_bag, total_error_rf),
Normal_Error = c(class_errors_dev["Normal_or_Under"], class_errors_gini["Normal_or_Under"], class_errors_bag["Normal_or_Under"], class_errors_rf["Normal_or_Under"]),
Overweight_Error = c(class_errors_dev["Overweight"], class_errors_gini["Overweight"], class_errors_bag["Overweight"], class_errors_rf["Overweight"]),
Obese_Error = c(class_errors_dev["Obese"], class_errors_gini["Obese"], class_errors_bag["Obese"], class_errors_rf["Obese"]),
AUC = c(auc_dev, auc_gini, auc_bag, auc_rf))
# 4. Format for display
summary_df_print <- summary_df
summary_df_print[, 2:5] <- lapply(summary_df_print[, 2:5], function(x) paste0(round(x * 100, 1), "%"))
summary_df_print$AUC <- round(summary_df$AUC, 4)
print(summary_df_print)error_matrix <- as.matrix(summary_df[, c("Total_Error", "Normal_Error", "Overweight_Error", "Obese_Error")])
rownames(error_matrix) <- summary_df$Model
col_palette <- c("#D73027", "#FC8D59", "#FEE090", "#91BFDB")
par(mar = c(7, 4, 4, 8), xpd = TRUE)
barplot(t(error_matrix) * 100, beside = TRUE,col = col_palette,main = "Model Error Comparison",ylab = "Error Rate (%)",ylim = c(0, max(error_matrix) * 120),las = 2,cex.names = 0.8)
legend("topright", inset = c(-0.25, 0),legend = c("Total", "Normal", "Overweight", "Obese"),
fill = col_palette,title = "Error Type",cex = 0.8,bty = "n")barplot(summary_df$AUC,names.arg = summary_df$Model,col = c("coral", "orange", "steelblue", "darkblue"),main = "Model AUC Comparison",ylab = "AUC",ylim = c(0, 1),las = 2,cex.names = 0.8)
abline(h = 0.5, lty = 2, col = "darkgray")best_idx <- which.min(summary_df$Total_Error)
best_model <- summary_df$Model[best_idx]
cat("\n===== BEST MODEL SELECTION =====\n\n")
===== BEST MODEL SELECTION =====
cat("Lowest Total Error: ", summary_df$Model[which.min(summary_df$Total_Error)],
"(", round(min(summary_df$Total_Error) * 100, 1), "%)\n")Lowest Total Error: Random Forest ( 31.8 %)
cat("Highest AUC: ", summary_df$Model[which.max(summary_df$AUC)],
"(", round(max(summary_df$AUC), 4), ")\n")Highest AUC: Random Forest ( 0.7503 )
cat("Best Overweight Class: ", summary_df$Model[which.min(summary_df$Overweight_Error)],
"(", round(min(summary_df$Overweight_Error) * 100, 1), "%)\n")Best Overweight Class: Tree (Gini) ( 69.2 %)
>> Selected Best Model: Random Forest
Comparing all four models on the test set, random forest performs best with a total error rate of 31.8% and the highest AUC of 0.75. Random forest correctly identifies normal weight people with 3.1% error, the same as the deviance tree. For obese cases, random forest has 64.7% error, matching bagging. The Gini tree performs better on overweight cases (69.2% error) compared to random forest (80.8% error). Overall, random forest achieves the best balance between total accuracy and discrimination between classes, making it the best model for this obesity classification task. The deviance tree is simple and interpretable with 10 nodes, but has the lowest AUC (0.65). The Gini tree is more complex with 23 nodes and performs better on overweight classification but worse overall. Bagging uses all variables and achieves moderate performance with AUC of 0.72. Random forest balances simplicity and performance by using only 3 variables per split, achieving the lowest total error (31.8%) and highest AUC (0.75). The main challenge across all models is classifying overweight and obese cases, as these groups share many characteristics. Demographic factors like age and height are consistently the most important predictors, followed by behavioral factors such as water intake and eating frequency. Random forest is selected as the best model because it offers the best trade-off between accuracy, AUC performance, and generalization to new data.
par(mfrow = c(1, 1))
plot(1:500, bag_model$err.rate[, "OOB"], type = "l", col = "red", lwd = 2,ylim = range(c(bag_model$err.rate[, "OOB"], rf_model$err.rate[, "OOB"])),xlab = "Number of Trees", ylab = "OOB Error Rate",main = "Bagging vs Random Forest: OOB Error Convergence")
lines(1:500, rf_model$err.rate[, "OOB"], col = "blue", lwd = 2)legend("topright", legend = c(paste0("Bagging (mtry=", p, ")"), paste0("Random Forest (mtry=", mtry_rf, ")")),col = c("red", "blue"), lwd = 2,bty = "n")
grid()The OOB error convergence plot shows how bagging and random forest improve as more trees are added. Both models start with high error rates around 45% when using few trees, then quickly improve as trees are added. Bagging converges to about 31.6% error and random forest converges to about 31.1% error. Random forest achieves lower error with fewer trees initially, suggesting it learns faster and more efficiently. By around 100 trees, both models have stabilized, meaning adding more trees beyond this point provides little improvement. The red line for bagging is slightly higher than the blue line for random forest throughout, showing that random forest’s strategy of considering only 3 random variables per split leads to better final performance. This demonstrates that random forest’s diversity-promoting approach works better than bagging’s approach of using all variables.
Ensemble methods like bagging and random forest consistently outperform single decision trees. The deviance tree has a total error rate of 33.6% and AUC of 0.65, while random forest achieves 31.8% error and AUC of 0.75. Bagging and random forest reduce overfitting by combining predictions from many trees built on different data samples. Random forest performs even better than bagging because it adds randomness by considering only 3 variables per split, which reduces correlation between trees and improves diversity. Age and height are the most important predictors across all models. Both demographic factors rank highest in variable importance measures, indicating they are fundamental to obesity classification. Water intake (CH2O) and eating frequency (CAEC) are the next most important, showing that behavioral factors also play a major role. Family history of overweight matters for obese predictions. Gender and smoking are less important. This consistency across different models suggests that these variables capture real patterns in obesity and should be the focus of intervention strategies. Age-related physiological changes and body structure (height) are the primary determinants of obesity risk. All models struggle more with overweight and obese classification compared to normal weight classification. The deviance tree has only 3.1% error for normal weight but 76.9% error for overweight and 82.4% for obese. This imbalance occurs because overweight and obese people share similar characteristics, making them hard to distinguish. The overweight class also has fewer samples (132 vs 327 normal weight), creating a class imbalance problem. Random forest reduces obese error to 64.7% but overweight remains difficult at 80.8%. The Gini tree performs best on overweight (69.2% error) because its larger size allows more complex decision boundaries.
Age and height are the strongest predictors of obesity, consistently ranking first and second across all models. Age matters because obesity risk changes with life stage—younger people tend to have different metabolic rates and lifestyles than older adults. Height is important because BMI and obesity classifications depend on body measurements relative to height. Together, these demographic factors explain most of the variation in obesity classification. Water intake (CH2O) is the third most important behavioral factor. People who drink more water have lower obesity risk, likely because water replaces high-calorie beverages and helps with satiety. Eating frequency outside the home (CAEC) is also important, suggesting that eating patterns and where people eat affects obesity. Physical activity (FAF) is important for obese classification, showing that exercise is a key factor in distinguishing obese people from other groups. Gender, smoking, and transportation method have lower importance scores. Gender shows some effect but is less predictive than age and height. Smoking has minimal importance, suggesting it is not a strong obesity indicator in this dataset. Transportation method (MTRANS_3) ranks lowest, meaning whether people walk, use cars, or use public transport has little effect on obesity prediction after accounting for other factors. The number of main meals (NCP_3) and caloric beverage consumption (CALC_3) show moderate importance but are weaker than the top predictors.
This analysis compared four tree-based models for obesity classification: single decision trees using deviance and Gini splits, bagging, and random forest. Random forest emerged as the best model with 31.8% total error and AUC of 0.75. Age and height are the strongest predictors of obesity, followed by behavioral factors like water intake and eating frequency. Ensemble methods significantly outperform single trees by reducing overfitting and combining diverse predictions. The main challenge is classifying overweight cases due to overlap with normal and obese groups and smaller sample size. Random forest balances accuracy and efficiency by using only 3 random variables per split, achieving faster convergence than bagging while maintaining interpretability of variable importance.
1- This analysis is for educational and research purposes only. The
models developed here should not be used for clinical diagnosis or
medical decision-making without validation by qualified healthcare
professionals. Obesity is a complex condition influenced by genetic,
environmental, behavioral, and medical factors not fully captured by the
available data. Individual predictions from these models may not
accurately reflect actual obesity risk. The dataset is primarily
synthetic and may not represent real populations. Results should not be
generalized beyond the specific population and features studied.
Healthcare providers should use their clinical judgment and
evidence-based guidelines for obesity assessment and treatment, not rely
solely on machine learning predictions. This analysis assumes data
quality and completeness without extensive data validation. Users of
these models assume all responsibility for appropriate application and
interpretation of results. 2- This analysis uses three elements not
covered in MQT7015 labs: pROC Package: The course uses
the ROCR package for ROC analysis, which only supports
binary classification. Since our target variable has 3 classes
(Normal_or_Under, Overweight, Obese), we used the pROC
package’s multiclass.roc() function to compute AUC for
multi-class problems. ggplot2 Package: Listed for
potential visualization but minimally used.
[1] Mendoza Palechor, F., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 25, 104344. https://doi.org/10.1016/j.dib.2019.104344 [2] Cremona, M. A., & Severino, F. (2024). MQT7015-Lab6: Arbres de décision [Decision Trees Lab]. Course materials for MQT7015. [3] Cremona, M. A., & Severino, F. (2024). MQT7015-Lab7: Forêts aléatoires et agrégation des modèles [Random Forests and Model Aggregation Lab]. Course materials for MQT7015.